AITopics

Country:

Asia > China > Beijing > Beijing (0.05)
Asia > China > Jiangsu Province > Changzhou (0.04)
Europe > France (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.89)

Neural Information Processing SystemsDec-24-2025, 09:36:50 GMT

Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis

Deep network models are often purely inductive during both training and inference on unseen data. When these models are used for prediction, but they may fail to capture important semantic information and implicit dependencies within datasets. Recent advancements have shown that combining multiple modalities in large-scale vision and language settings can improve understanding and generalization performance. However, as the model size increases, fine-tuning and deployment become computationally expensive, even for a small number of downstream tasks. Moreover, it is still unclear how domain or prior modal knowledge can be specified in a backpropagation friendly manner, especially in large-scale and noisy settings.

electronic proceedings, enable robust deep multimodal analysis, name change, (5 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.56)
Information Technology > Artificial Intelligence > Natural Language (0.39)

arXiv.org Artificial IntelligenceSep-22-2025

Towards deployment-centric multimodal AI beyond vision and language

Liu, Xianyuan, Zhang, Jiayang, Zhou, Shuo, van der Plas, Thijs L., Vijayaraghavan, Avish, Grishina, Anastasiia, Zhuang, Mengdie, Schofield, Daniel, Tomlinson, Christopher, Wang, Yuhan, Li, Ruizhe, van Zeeland, Louisa, Tabakhi, Sina, Demeocq, Cyndie, Li, Xiang, Das, Arunav, Timmerman, Orlando, Baldwin-McDonald, Thomas, Wu, Jinge, Bai, Peizhen, Sahili, Zahraa Al, Alwazzan, Omnia, Do, Thao N., Suvon, Mohammod N. I., Wang, Angeline, Cipolina-Kun, Lucia, Moretti, Luigi A., Farndale, Lucas, Jain, Nitisha, Efremova, Natalia, Ge, Yan, Varela, Marta, Lam, Hak-Keung, Celiktutan, Oya, Evans, Ben R., Coca-Castro, Alejandro, Wu, Honghan, Abdallah, Zahraa S., Chen, Chen, Danchev, Valentin, Tkachenko, Nataliya, Lu, Lei, Zhu, Tingting, Slabaugh, Gregory G., Moore, Roger K., Cheung, William K., Charlton, Peter H., Lu, Haiping

Multimodal artificial intelligence (AI) integrates diverse types of data via machine learning to improve understanding, prediction, and decision-making across disciplines such as healthcare, science, and engineering. However, most multimodal AI advances focus on models for vision and language data, while their deployability remains a key challenge. We advocate a deployment-centric workflow that incorporates deployment constraints early to reduce the likelihood of undeployable solutions, complementing data-centric and model-centric approaches. We also emphasise deeper integration across multiple levels of multimodality and multidisciplinary collaboration to significantly broaden the research scope beyond vision and language. To facilitate this approach, we identify common multimodal-AI-specific challenges shared across disciplines and examine three real-world use cases: pandemic response, self-driving car design, and climate change adaptation, drawing expertise from healthcare, social science, engineering, science, sustainability, and finance. By fostering multidisciplinary dialogue and open research practices, our community can accelerate deployment-centric development for broad societal impact.

data mining, machine learning, real time system, (19 more...)

2504.03603

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)

Genre:

Research Report > Experimental Study (1.00)
Workflow (0.89)
Research Report > Strength High (0.68)

Industry:

Transportation (1.00)
Information Technology > Services (1.00)
Information Technology > Security & Privacy (1.00)
(8 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(6 more...)

Neural Information Processing SystemsJan-26-2025, 03:02:20 GMT

Reviews: Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

The paper proposes a new method called Mutual Iterative Attention (MIA) for improving the representations used by common visual-question-answering and image captioning models. MIA works by repeated execution of'mutual attention', a computation that is similar to the self-attention operation in the Transformer model, but where the lookup ('query') representation is conditioned by information from the other modality. Importantly, the two modalities involved in the MIA operation are not vision and language, they are vision and'textual concepts' (which they also call'textual words' and'visual words' at various points in the paper). These are actual words referring to objects that can be found in the image. The model that predicts textual concepts (the'visual words' extractor) is trained on the MS-COCO dataset in a separate optimization to the captioning model Applying MIA to a range of models before attempting VQA or captioning tasks improves the scores, in some cases above the state-of-the-art. It is a strength of this paper that the authors apply their method to a wide range of existing models and observe consistent improvements.

semantic-grounded image representation, textual concept, visual region and textual concept, (11 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.57)
Information Technology > Sensing and Signal Processing > Image Processing (0.40)
Information Technology > Artificial Intelligence > Vision (0.40)

Masala, Mihai, Leordeanu, Marius

Towards Zero-Shot & Explainable Video Description by Reasoning over Graphs of Events in Space and Time

arXiv.org Artificial IntelligenceJan-14-2025

Moreover, such models suffer from overfitting such that Transformer-based solutions are the backbone of current once given a video from an unseen context or distribution state-of-the-art methods for language generation, image the quality and accuracy of the description drops, as our and video classification, segmentation, action and object evaluations prove. On the other hand, VLLMs have shown recognition, among many others. Interestingly enough, impressive results, being capable of generating long, rich while these state-of-the-art methods produce impressive results descriptions of videos. Unfortunately VLLMs still share in their respective domains, the problem of understanding some of the same weaknesses as previous methods: they are the relationship between vision and language is largely unexplainable and they still rely on sampling frames still beyond our reach. In this work, we propose a common to process a video. Moreover, top-performing models such ground between vision and language based on events as GPT, Claude or Gemini are not open and are only accessible in space and time in an explainable and programmatic way, via an paid API. to connect learning-based vision and language state of the We argue that one of the main reasons why this interdisciplinary art models and provide a solution to the long standing problem cross-domain task is still far from being solved is of describing videos in natural language. We validate that we still lack an explainable way to bridge this apparently that our algorithmic approach is able to generate coherent, insurmountable gap. Explainability could provide a rich and relevant textual descriptions on videos collected more analytical and stage-wise way to make the transition from a variety of datasets, using both standard metrics (e.g. from vision to language that is both trustworthy and makes Bleu, ROUGE) and the modern LLM-as-a-Jury approach.

large language model, machine learning, natural language, (19 more...)

2501.0846

Genre: Research Report > Promising Solution (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-10-2024, 19:56:49 GMT

Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis

enable robust deep multimodal analysis, explicit knowledge, vision and language, (2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.59)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.44)

Han, Soyeon Caren, Cao, Feiqi, Poon, Josiah, Navigli, Roberto

Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond

arXiv.org Artificial IntelligenceOct-7-2024

This tutorial explores recent advancements in multimodal pretrained and large models, capable of integrating and processing diverse data forms such as text, images, audio, and video. Participants will gain an understanding of the foundational concepts of multimodality, the evolution of multimodal research, and the key technical challenges addressed by these models. We will cover the latest multimodal datasets and pretrained models, including those beyond vision and language. Additionally, the tutorial will delve into the intricacies of multimodal large models and instruction tuning strategies to optimise performance for specific tasks. Hands-on laboratories will offer practical experience with state-of-the-art multimodal models, demonstrating real-world applications like visual storytelling and visual question answering. This tutorial aims to equip researchers, practitioners, and newcomers with the knowledge and skills to leverage multimodal AI. ACM Multimedia 2024 is the ideal venue for this tutorial, aligning perfectly with our goal of understanding multimodal pretrained and large language models, and their tuning mechanisms.

language model, tuning, tutorial, (13 more...)

2410.05608

Country:

Oceania > Australia > Victoria > Melbourne (0.06)
Oceania > Australia > New South Wales > Sydney (0.05)
North America > United States > New York > New York County > New York City (0.05)
(2 more...)

Genre: Instructional Material > Course Syllabus & Notes (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

arXiv.org Artificial IntelligenceOct-19-2023

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

Zhang, Xiang, Li, Senyu, Wu, Zijun, Shi, Ning

Recent advancements in multimodal techniques open exciting possibilities for models excelling in diverse tasks involving text, audio, and image processing. Models like GPT-4V, blending computer vision and language modeling, excel in complex text and image tasks. Numerous prior research endeavors have diligently examined the performance of these Vision Large Language Models (VLLMs) across tasks like object detection, image captioning and others. However, these analyses often focus on evaluating the performance of each modality in isolation, lacking insights into their cross-modal interactions. Specifically, questions concerning whether these vision-language models execute vision and language tasks consistently or independently have remained unanswered. In this study, we draw inspiration from recent investigations into multilingualism and conduct a comprehensive analysis of model's cross-modal interactions. We introduce a systematic framework that quantifies the capability disparities between different modalities in the multi-modal setting and provide a set of datasets designed for these evaluations. Our findings reveal that models like GPT-4V tend to perform consistently modalities when the tasks are relatively simple. However, the trustworthiness of results derived from the vision modality diminishes as the tasks become more challenging. Expanding on our findings, we introduce "Vision Description Prompting," a method that effectively improves performance in challenging vision-related tasks.

dataset, information, modality, (16 more...)

2310.1252

Country:

North America > Canada > Alberta (0.15)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.49)

Masala, Mihai, Cudlenco, Nicolae, Rebedea, Traian, Leordeanu, Marius

Explaining Vision and Language through Graphs of Events in Space and Time

arXiv.org Artificial IntelligenceAug-29-2023

Artificial Intelligence makes great advances today and starts to bridge the gap between vision and language. However, we are still far from understanding, explaining and controlling explicitly the visual content from a linguistic perspective, because we still lack a common explainable representation between the two domains. In this work we come to address this limitation and propose the Graph of Events in Space and Time (GEST), by which we can represent, create and explain, both visual and linguistic stories. We provide a theoretical justification of our model and an experimental validation, which proves that GEST can bring a solid complementary value along powerful deep learning models. In particular, GEST can help improve at the content-level the generation of videos from text, by being easily incorporated into our novel video generation engine. Additionally, by using efficient graph matching techniques, the GEST graphs can also improve the comparisons between texts at the semantic level.

graph, space and time, vision and language

2309.08612

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)

arXiv.org Artificial IntelligenceJul-6-2023

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Zhang, Chunhui, Sun, Xin, Liu, Li, Yang, Yiqian, Liu, Qiong, Zhou, Xi, Wang, Yanfeng

Current mainstream vision-language (VL) tracking framework consists of three parts, \ie a visual feature extractor, a language feature extractor, and a fusion model. To pursue better performance, a natural modus operandi for VL tracking is employing customized and heavier unimodal encoders, and multi-modal fusion models. Albeit effective, existing VL trackers separate feature extraction and feature integration, resulting in extracted features that lack semantic guidance and have limited target-aware capability in complex scenarios, \eg similar distractors and extreme illumination. In this work, inspired by the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks, we propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone. Specifically, we mix raw vision and language signals to generate language-injected vision tokens, which we then concatenate before feeding into the unified backbone architecture. This approach achieves feature integration in a unified backbone, removing the need for carefully-designed fusion modules and resulting in a more effective and efficient VL tracking framework. To further improve the learning efficiency, we introduce a multi-modal alignment module based on cross-modal and intra-modal contrastive objectives, providing more reasonable representations for the unified All-in-One transformer backbone. Extensive experiments on five benchmarks, \ie OTB99-L, TNL2K, LaSOT, LaSOT$_{\rm Ext}$ and WebUAV-3M, demonstrate the superiority of the proposed tracker against existing state-of-the-arts on VL tracking. Codes will be made publicly available.

large language model, machine learning, natural language, (19 more...)

2307.03373

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Social Sector (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)